10. Summary

Summary

REINFORCE increases the probability of "good" actions and decreases the probability of "bad" actions. ([Source](https://blog.openai.com/evolution-strategies/))

REINFORCE increases the probability of "good" actions and decreases the probability of "bad" actions. (Source)

### What are Policy Gradient Methods?

  • Policy-based methods are a class of algorithms that search directly for the optimal policy, without simultaneously maintaining value function estimates.
  • Policy gradient methods are a subclass of policy-based methods that estimate the weights of an optimal policy through gradient ascent.
  • In this lesson, we represent the policy with a neural network, where our goal is to find the weights \theta of the network that maximize expected return.

### The Big Picture

  • The policy gradient method will iteratively amend the policy network weights to:
    • make (state, action) pairs that resulted in positive return more likely, and
    • make (state, action) pairs that resulted in negative return less likely.

### Problem Setup

  • A trajectory \tau is a state-action sequence s_0, a_0, \ldots, s_H, a_H, s_{H+1}.
  • In this lesson, we will use the notation R(\tau) to refer to the return corresponding to trajectory \tau.
  • Our goal is to find the weights \theta of the policy network to maximize the expected return U(\theta) := \sum_\tau \mathbb{P}(\tau;\theta)R(\tau).

### REINFORCE

  • The pseudocode for REINFORCE is as follows:
  1. Use the policy \pi_\theta to collect m trajectories { \tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(m)}} with horizon H. We refer to the i-th trajectory as
    \tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, s_{H+1}^{(i)})
    .
  2. Use the trajectories to estimate the gradient \nabla_\theta U(\theta):
    \nabla_\theta U(\theta) \approx \hat{g} := \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})
  3. Update the weights of the policy:
    \theta \leftarrow \theta + \alpha \hat{g}
  4. Loop over steps 1-3.

### Derivation

  • We derived the likelihood ratio policy gradient: \nabla_\theta U(\theta) = \sum_\tau \mathbb{P}(\tau;\theta)\nabla_\theta \log \mathbb{P}(\tau;\theta)R(\tau) .
  • We can approximate the gradient above with a sample-weighted average:
    \nabla_\theta U(\theta) \approx \frac{1}{m}\sum_{i=1}^m \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta)R(\tau^{(i)})
    .
  • We calculated the following:
    \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta (a_t^{(i)}|s_t^{(i)})
    .

### What's Next?

  • REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.